Skip to content

P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7)#15

Merged
mastermanas805 merged 2 commits into
masterfrom
ship/p0-7-api-grace-period-2026-05-20
May 20, 2026
Merged

P0-7: instant-api terminationGracePeriodSeconds=35 + preStop (MR-P0-7)#15
mastermanas805 merged 2 commits into
masterfrom
ship/p0-7-api-grace-period-2026-05-20

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Diagnosis

The api Deployment in k8s/app.yaml was missing two pieces needed for graceful shutdown:

  1. terminationGracePeriodSeconds — defaulted to 30s, which collides with
    the api's drain budget: preStop sleep 5 + readinessDrainGrace 3 + ShutdownWithTimeout 25 = 33s of in-process work. Kubelet was SIGKILLing mid-drain.

  2. preStop lifecycle hook — without it, the kubelet sends SIGTERM
    immediately on pod termination. The LB doesn't refresh Service endpoints
    until the readinessProbe fails on the next tick — so new traffic kept
    landing on a pod that was about to stop accepting connections.

Diff Summary

k8s/app.yaml:

  • New terminationGracePeriodSeconds: 35 on the api pod spec
    (budget: preStop 5s + readinessDrainGrace 3s + shutdownTimeout 25s + safety 2s).
  • New lifecycle.preStop.exec.command: ["/bin/sh", "-c", "sleep 5"]
    on the api container — gives the kubelet a window to observe the api's
    /readyz 503 flip (via hooks.Readyz.MarkDraining in the api repo's
    companion PR) and update Service endpoints before SIGTERM is delivered.

Required Companion PR

api repoship/p0-7-graceful-shutdown-readiness-2026-05-20
adds MarkDraining to /readyz + wires hooks.Readyz.MarkDraining() into the
SIGTERM handler. Must land together — this manifest change alone widens the
grace period but doesn't flip /readyz to 503, so the LB still routes new
traffic to a draining pod.

Live Verify Plan (post-merge)

  1. kubectl apply -f k8s/app.yaml (or whichever path is canonical for infra).
  2. kubectl rollout restart deploy/instant-api -n instant
  3. kubectl describe pod mid-roll shows preStop running, then probe failing, then container exit.
  4. kubectl get events -n instant --sort-by='.lastTimestamp' | tail -20 — no
    'FailedKillPod' / 'killed before terminationGracePeriod' events.

🤖 Generated with Claude Code

mastermanas805 and others added 2 commits May 20, 2026 16:26
…MR-P0-7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new Prometheus alerts tied to the worker repo's PASS 3 enhanced
reasons + PASS 6 stuck-build counters:

- OrphanSweepNoDBRowReap (CRITICAL, 1h): a k8s namespace had no backing
  deployments row — the P0-3 atomic-provision symptom. Pages on first
  occurrence over 1h.

- OrphanSweepStuckBuildSpike (WARNING, 15m): >5 stuck-build flips in 15m
  means the kaniko/GHCR build pipeline is degraded for many customers
  at once.

- OrphanSweepReapFailureRate (WARNING, 30m): the reconciler detected
  orphans it cannot reap (k8s/DB write failure sustained).

The counters land in worker master commit 7d2ff0d; the alerts go live
once the deploy lands + scrape picks them up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 7ad904e into master May 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant